Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cythonized GroupBy Fill #19673

Merged
merged 25 commits into from
Feb 25, 2018
Merged

Cythonized GroupBy Fill #19673

merged 25 commits into from
Feb 25, 2018

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Feb 13, 2018

I am not a fan of how I've implemented this in groupby.py but I think this change highlights even more the need for some refactoring of that module, specifically in how Cython transformations are getting dispatched. This still works in the meantime but open to any feedback on how the methods are getting wired back to the Cython layer.

Below are ASVs for the change

       before           after         ratio
     [d9551c8e]       [5e007f86]
+      81.6±0.3μs         96.7±2μs     1.19  groupby.GroupByMethods.time_method('float', 'count')
+         455±3μs         516±20μs     1.13  groupby.GroupByMethods.time_method('float', 'cummin')
+       306±0.6μs          340±5μs     1.11  groupby.GroupByMethods.time_method('float', 'prod')
-       142±0.4ms          280±6μs     0.00  groupby.GroupByMethods.time_method('int', 'bfill')
-        230±10ms          327±7μs     0.00  groupby.GroupByMethods.time_method('float', 'bfill')
-       135±0.8ms        189±0.3μs     0.00  groupby.GroupByMethods.time_method('int', 'ffill')
-         227±2ms        184±0.6μs     0.00  groupby.GroupByMethods.time_method('float', 'ffill')


sorted_labels = np.argsort(labels)
if method == 'bfill':
sorted_labels[::-1].sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the labels reversed before sorting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They only get reversed in the case of bfill - it's the cheapest way to fill backwards that I could think of

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but doesn't the .sort() undo the reversing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took that from the below SO article and the comment by perimosocordiae was helpful in figuring out what's going on. Open to changing it to sorted_labels = np.sort(sorted_labels)[::-1] for readability if you'd like

https://stackoverflow.com/questions/26984414/efficiently-sorting-a-numpy-array-in-descending-order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks! Perhaps too clever, how about this? still avoids the copy

sorted_labels.sort()
sorted_labels = sorted_labels[::-1]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think you are on to something - the extra sort doesn't feel right and I think it could cause a bug as it will really just return an array ranging from N-1..0. Going to work up some test cases to validate and add fix to next commit

out[idx, 0] == {{nan_val}}
filled_vals += 1
else: # reset items when not missing
filled_vals = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't filled_values need to be tracked by group? Consider this case (haven't tested on your PR):

df = pd.DataFrame({
'k': [1, 1, 2, 2, 1, 1], 'v': [1.1, np.nan, 1, np.nan, np.nan, np.nan]})

df.groupby('k').ffill(limit=2)
Out[25]: 
   k    v
0  1  1.1
1  1  1.1
2  2  1.0
3  2  1.0
4  1  1.1
5  1  NaN

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be tracked by both group and value. One thing to keep in mind is that the main loop in the Cython function doesn't go sequentially over the values in the series / frame, but rather iterates over the argsorteded labels, which allows you to keep track of groups easily. Just ran your example on my PR and it gave the result you have above.

That said, I don't have a problem adding a test case that mixes the groups up to prevent this from breaking in the future. Can bundle that in with the next commit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case for this is now a little more complicated, handling both sequential groups (ex: ['a', 'a', 'a', 'b', 'b', 'b']) and "interwoven" (ex: ['a', 'b', 'a', 'b', 'a', 'b']). Open to splitting it up into separate tests if you think that would make it more readable.

Otherwise this instance should be covered. I fixed a bug in the Cython code to go along with it - thanks for the callout!

@@ -1472,7 +1479,7 @@ def pad(self, limit=None):
Series.fillna
DataFrame.fillna
"""
return self.apply(lambda x: x.ffill(limit=limit))
return self.apply('ffill', limit=limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than calling apply, pattern so far has been to define the method directly Groupby.cumsum, etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tied to the comment below, I added _cython_apply as its own function to handle dispatching back to the Cython layer, which this apply tries first before falling back to the current method of iterating over the groups and applying the method directly from there. Items like Groupby.cumsum call _cython_transform in a similar fashion, which I was trying to emulate here but I was hesitant to send these methods down the transform route because of the complexity that would cause there

@@ -2032,6 +2039,38 @@ def _get_group_keys(self):
self.levels,
self.labels)

def _cython_apply(self, ftype, data, axis, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all this should pass down to _cython_transform?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little torn over this. The call signature doesn't play all that nicely with the other methods utilizing _cython_transform. Specifically, the is_numeric and is_datetimelike arguments aren't applicable and there are some conditionals that need to be added to prevent accidental casting of data during the fill. On top of that, we'd have to wrap the transformation to be sure to include the grouped columns as none of the other transformations do that.

Touched on this with @jreback in the comments of #19481 - there for sure needs to be a pretty comprehensive refactoring on groupby.py to more effectively dispatch Cython operations. For the time being I thought this approach would be the "best of the worst" option, but can take another look at wiring into transform if we feel that is a show stopper

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree the whole thing needs some refactoring, xref also to #19354 - I'll think about it a bit more too

@WillAyd WillAyd changed the title Grp fill perf Cythonized GroupBy Fill Feb 13, 2018
@chris-b1
Copy link
Contributor

One possibility to clean this up would be to replicate GroupBy.shift

def shift(self, periods=1, freq=None, axis=0):

Rather than type specific functions, there the cython routine computes a single indexer, then the normal take functions are used to make the actual output.

def group_shift_indexer(int64_t[:] out, int64_t[:] labels,

Probably a bit slower than what you do here, but would be surprised if it's bad.

@WillAyd
Copy link
Member Author

WillAyd commented Feb 13, 2018

Thanks for the idea. I think I could make something like that work here and that would definitely simplify things

@WillAyd
Copy link
Member Author

WillAyd commented Feb 13, 2018

Just simplified the code to mirror what we do for shift. The groupby.py changes are MUCH cleaner now and there doesn't appear to be any significant change to benchmarks either. Updated results below:

       before           after         ratio
     [07137a5a]       [0939aeae]
+           983ms            1.23s     1.25  groupby.GroupByMethods.time_method('float', 'pct_change')
+         356±4μs         443±20μs     1.24  groupby.GroupByMethods.time_method('float', 'prod')
+      84.6±0.2μs          101±3μs     1.19  groupby.GroupByMethods.time_method('int', 'count')
+         185±1μs          218±3μs     1.18  groupby.GroupByMethods.time_method('int', 'cumcount')
+         161±3μs          185±3μs     1.15  groupby.GroupByMethods.time_method('float', 'cumcount')
+         334±9μs         382±20μs     1.14  groupby.GroupByMethods.time_method('float', 'median')
+         461±3μs         523±10μs     1.13  groupby.GroupByMethods.time_method('int', 'cummax')
+       344±0.8μs         387±10μs     1.12  groupby.GroupByMethods.time_method('float', 'nunique')
+      64.5±0.2μs       72.1±0.9μs     1.12  groupby.GroupByMethods.time_method('int', 'size')
+         456±2μs         504±20μs     1.11  groupby.GroupByMethods.time_method('float', 'cummin')
+         133±5ms          146±3ms     1.10  groupby.GroupByMethods.time_method('int', 'any')
-        769±20μs         684±30μs     0.89  groupby.GroupByMethods.time_method('int', 'cumprod')
-       893±100μs         598±30μs     0.67  groupby.GroupByMethods.time_method('int', 'sem')
-         145±2ms          275±7μs     0.00  groupby.GroupByMethods.time_method('int', 'bfill')
-         154±3ms          245±1μs     0.00  groupby.GroupByMethods.time_method('int', 'ffill')
-         228±5ms         270±10μs     0.00  groupby.GroupByMethods.time_method('float', 'bfill')
-         237±5ms        232±0.5μs     0.00  groupby.GroupByMethods.time_method('float', 'ffill')

@gfyoung gfyoung added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Feb 14, 2018
Copy link
Contributor

@chris-b1 chris-b1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good


N = len(out)

sorted_labels = np.argsort(labels).view(dtype=np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy as in #19701

if limit is None:
limit = -1
output = {}
if type(self) is DataFrameGroupBy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this if branch for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fill is unique in that unlike say transformations the column(s) used in the grouping are also returned as part of the output of a DataFrame object. _iterate_slices does not go over those items and I didn't see an existing method to handle that, so this conditional makes sure the grouped columns still make their way into the frame that gets returned

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I was thinking to myself if its worth adding an instance method to the GroupBy object to handle this (_iterate_groups or something to the effect) but couldn't think of any other function that requires this feature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing the if type(...), make a method in SeriesGroupBy and DataFrameGroupBy

@codecov
Copy link

codecov bot commented Feb 15, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@feedf66). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #19673   +/-   ##
=========================================
  Coverage          ?   91.65%           
=========================================
  Files             ?      150           
  Lines             ?    48962           
  Branches          ?        0           
=========================================
  Hits              ?    44875           
  Misses            ?     4087           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.03% <100%> (?)
#single 41.8% <17.24%> (?)
Impacted Files Coverage Δ
pandas/core/groupby.py 92.31% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update feedf66...eff6603. Read the comment docs.

['all', 'any', 'bfill', 'count', 'cumcount', 'cummax', 'cummin',
'cumprod', 'cumsum', 'describe', 'ffill', 'first', 'head',
'last', 'mad', 'max', 'min', 'median', 'mean', 'nunique',
'pct_change', 'prod', 'rank', 'sem', 'shift', 'size', 'skew',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we bench for the compat methods for datetimelikes? (timestamp / timedelta), most of these would work (except prod, mean, mad, pct_change and a few more), though these likely work for timedeltas. ok with adding an issue to do this as well (IOW doesn't need to be in this PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #19733

@@ -646,6 +646,7 @@ Performance Improvements
- Improved performance of pairwise ``.rolling()`` and ``.expanding()`` with ``.cov()`` and ``.corr()`` operations (:issue:`17917`)
- Improved performance of :func:`DataFrameGroupBy.rank` (:issue:`15779`)
- Improved performance of variable ``.rolling()`` on ``.min()`` and ``.max()`` (:issue:`19521`)
- Improved performance of :func:`GroupBy.ffill` and :func:`GroupBy.bfill` (:issue:`11296`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this will render. you can just do .groupby().ffill() and so on

with nogil:
for i in range(N):
idx = sorted_labels[i]
if mask[idx] == 1: # is missing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curr_fill_idx seems potentially not defined (if it is missing but the limit is not hit), not sure if that is possible. maybe could initialize curr_fill_idx = -1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialization occurs at declaration. Can move that into the body of the function if you feel that is more readable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh see that ok then

int64_t curr_fill_idx=-1
int64_t idx, filled_vals=0

N = len(out)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe want an assert that N == len(labels) == len(mask)

if limit is None:
limit = -1
output = {}
if type(self) is DataFrameGroupBy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than doing the if type(...), make a method in SeriesGroupBy and DataFrameGroupBy

for name, obj in self._iterate_slices():
indexer = np.zeros_like(labels)
mask = isnull(obj.values).view(np.uint8)
libgroupby.group_fillna_indexer(indexer, mask, labels, how,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make a routine that shares code with how the shift_indexer is called (I am talking about in python space here)

with nogil:
for i in range(N):
idx = sorted_labels[i]
if mask[idx] == 1: # is missing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh see that ok then

curr_fill_idx = idx

out[idx] = curr_fill_idx
# If we move to the next group, reset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line here

@WillAyd
Copy link
Member Author

WillAyd commented Feb 19, 2018

Latest commit includes a shared function between fill and shift, which can theoretically house any / all implementations with a little clean up as well. As touched on in previous conversations, if we expand this it should probably be in its own module

Open to review and plan to change the location of the func within the module / update docs, but I wanted to push because this commit will cause tests to fail at the following location:

tm.assert_frame_equal(expected, getattr(gb, op)(*args))

The reason this is failing is because I replaced output = {} in the original shift code with output = collections.OrderedDict() in the newly shared function. I'm not entirely clear on why that would impact the test case, but it seems like a more explicit code path regardless.

I believe the test case is wrong anyway and the failing line should either have a .sort_index call or be placed before the expected = expected.sort_index(axis=1) call a few lines ahead of it, but wanted to get your input and make sure I am not misreading that

@jreback
Copy link
Contributor

jreback commented Feb 19, 2018

The reason this is failing is because I replaced output = {} in the original shift code with output = collections.OrderedDict() in the newly shared function. I'm not entirely clear on why that would impact the test case, but it seems like a more explicit code path regardless.

its prob happenstance that this worked before, IOW it happened to be sorted already. go ahead and add what is needed to make it pass

(and FYI make this an expected = ), rather than code in assert_frame_equal block

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason you want this to be a separate module at this point? it looks ok here

-------
GroupBy object populated with appropriate result(s)
"""
exp_kwds = collections.OrderedDict([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be passed in directly (from the calling function), no? rather than hard coded inside this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, maybe the caller passes in a partial the kwargs baked rather than the string version of the funtion?

_get_cythonized_result(
    partial(pd._libs.groupby.group_shift_indexer, nperiods=2),
    ...
)

Copy link
Member Author

@WillAyd WillAyd Feb 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll think more about that. The one complicating factor is that masked values cannot be passed in directly as they are calculated on a per slice basis. To avoid a convoluted set of calls I figured it would make sense to have all keywords and positional arguments resolved within one function, but there’s certainly other ways to go about this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 comment is about the exp_kwds, I am still not clear why the caller cannot pass these

if needs_ngroups:
func = partial(func, ngroups)

# Convert any keywords into positional arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can pass kwargs to cython functions, so not sure this is necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explicitly converted kwargs into positional arguments from the first comment back in #19481. If we are OK with kwargs in the Cython layer I can rewrite this, as it could simplify a few things

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you can pass kwargs

base_func = getattr(libgroupby, how)

for name, obj in self._iterate_slices():
indexer = np.zeros_like(labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to specify int64? (or platform int)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, int64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a copy/paste of existing code, so I didn't give it too much thought. That said, labels is already wrapped with a _ensure_int64 call as part of group_info within the BaseGrouper, so unless there's something in particular you know of I figure it would be easiest to inherit that type from labels and not override

def _fill(self, direction, limit=None):
"""Overriden method to concat grouped columns in output"""
res = super()._fill(direction, limit=limit)
output = collections.OrderedDict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be slightly simpler as a list-comprehension

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though maybe this should be an arg to _fill? (IOW the option to do it should be done in fill)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean as an arg to _get_cythonized_result (fill would always use, unless we wanted to implement a new feature)? I was thinking about that but given fill on a DataFrameGroupBy is the only operation I know of at the moment that includes the grouping in the body of the returned object I figured it made the most sense to just perform that action there

base_func = getattr(libgroupby, how)

for name, obj in self._iterate_slices():
indexer = np.zeros_like(labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, int64

-------
GroupBy object populated with appropriate result(s)
"""
exp_kwds = collections.OrderedDict([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, maybe the caller passes in a partial the kwargs baked rather than the string version of the funtion?

_get_cythonized_result(
    partial(pd._libs.groupby.group_shift_indexer, nperiods=2),
    ...
)

indexer = np.zeros_like(labels)
func = partial(base_func, indexer, labels)
if needs_mask:
mask = isnull(obj.values).astype(np.uint8, copy=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contrary to the int64 case, you do want to use view here - numpy doesn't see bool and uint8 as the 'same' type, so will always copy, which view elides.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up - still trying to wrap my head around some of the nuances there. Out of curiosity, how do you know this? Just from experience or is there a good reference?

I do see that astype documentation mentions dtype, order and subok requirements need to be satisfied to prevent copy, but I'm not clear on what those requirements are and if they are documented

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately all I can really point to is trial and error. Some "general" rules, at least in a pandas context

  1. copy=False will only elide a copy if the dtype is identical (may be contradictions to this, but at least not in general). Useful for conditional casts.
  2. arr.view(dtype) never copies. Viewing across types should only be used if if the itemsize of the two types are identical. Primarily useful for casting a type with some metadata down to a a more primitive type (bool to uint8, datetime64 to int64).

@WillAyd
Copy link
Member Author

WillAyd commented Feb 19, 2018

To clarify my point about having a separate module for _get_cythonized_result I don't think that method as a stand-alone needs it's own module, but thinking long term and going back to the discussion in #19481 a class to more effectively dispatch all of the functions (agg and transform included) may make sense as a sub-module of groupby.py and _get_cythonized_result could potentially evolve into that

@jreback
Copy link
Contributor

jreback commented Feb 19, 2018

@WillAyd ahh perfect - yes absolutely in favor of s class dispatch mechanism
and should be in its own cython/python modules as needed
ok with merging this (small fix ups) then attacking that (any/all can be in current or new framework)

@pep8speaks
Copy link

pep8speaks commented Feb 19, 2018

Hello @WillAyd! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 24, 2018 at 16:41 Hours UTC

@@ -1457,6 +1457,15 @@ def expanding(self, *args, **kwargs):
from pandas.core.window import ExpandingGroupby
return ExpandingGroupby(self, *args, **kwargs)

def _fill(self, direction, limit=None):
# Need int value for Cython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string

-------
GroupBy object populated with appropriate result(s)
"""
exp_kwds = collections.OrderedDict([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 comment is about the exp_kwds, I am still not clear why the caller cannot pass these

if needs_ngroups:
func = partial(func, ngroups)

# Convert any keywords into positional arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you can pass kwargs

int64_t lab, idxer, idxer_slot
int64_t[:] label_seen = np.zeros(ngroups, dtype=np.int64)
int64_t[:, :] label_indexer

periods = kwargs['periods']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, if you are not doing a .get() here, is there a reason you are not simply haveing periods=None in the signature? (if its optional)?

ndarray[int64_t] sorted_labels
int64_t limit, idx, curr_fill_idx=-1, filled_vals=0

direction = kwargs['direction']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same again why not simply list them in the signature? (you can have **kwargs if you want to ignore other ones)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to define a default value for something like periods in both the Python and Cython layers, leaving the former to set that default and be responsible for passing to Cython.

It's a pretty trivial change so will re-push with your changes reflected soon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I've been looking at this change too long...planning on making these required in the Cython signature and using the existing defaults in the Python layer to populate those via **kwargs. If we are not on the same page let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that sounds good

@WillAyd
Copy link
Member Author

WillAyd commented Feb 24, 2018

Travis failure on latest commit looks unrelated (something with pyarrow?)

@jreback jreback merged commit d87ca1c into pandas-dev:master Feb 25, 2018
@jreback
Copy link
Contributor

jreback commented Feb 25, 2018

thanks @WillAyd very nice as always!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: groupby-fillna perf, implement in cython
5 participants